For the take-home part of the MSDS 401 Final Exam, you are tasked with analyzing data on new daily covid-19 cases and deaths in European Union (EU) and European Economic Area (EEA) countries. A data file may be downloaded here, or you may use the provided read.csv() code in the ‘setup’ code chunk below to read the data directly from the web csv. Either approach is acceptable; the data should be the same.

Once you have defined a data frame with the daily case and death and country data, you are asked to: (1) perform an Exploratory Data Analysis (EDA), (2) perform some hypothesis testing, (3) perform some correlation testing, and (4) fit and describe a linear regression model. Each of these four (4) items is further explained below and “code chunks” have been created for you in which to add your R code, just as with the R and Data Analysis Assignments. You may add additional code chunks, as needed. You should make comments in the code chunks or add clarifying text between code chunks that you think further your work.

A data dictionary for the dataset is available here.

Definitions:


1. Descriptive Statistics

Perform an Exploratory Data Analysis (EDA). Your EDA is exactly that: yours. Your knit .html should include the visualizations and summary tables that you find valuable while exploring this dataset. However, at minimum, your EDA must include the following:

#create vectors
data$Incidence_rate <- (data$cases/data$popData2020)*100000
data$Fatality_rate <- (data$deaths/data$popData2020) *100000
head(data)
##      dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-23  23    10 2022  3557      0                 Austria    AT
## 2 2022-10-22  22    10 2022  5494      4                 Austria    AT
## 3 2022-10-21  21    10 2022  7776      4                 Austria    AT
## 4 2022-10-20  20    10 2022  8221      6                 Austria    AT
## 5 2022-10-19  19    10 2022 10007      8                 Austria    AT
## 6 2022-10-18  18    10 2022 13204      7                 Austria    AT
##   countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1                  AUT     8901064       39.96151    0.00000000
## 2                  AUT     8901064       61.72296    0.04493845
## 3                  AUT     8901064       87.36034    0.04493845
## 4                  AUT     8901064       92.35974    0.06740767
## 5                  AUT     8901064      112.42476    0.08987690
## 6                  AUT     8901064      148.34182    0.07864228
summary(data)
##     dateRep                day            month             year     
##  Min.   :2020-01-01   Min.   : 1.00   Min.   : 1.000   Min.   :2020  
##  1st Qu.:2020-10-17   1st Qu.: 8.00   1st Qu.: 4.000   1st Qu.:2020  
##  Median :2021-06-17   Median :16.00   Median : 6.000   Median :2021  
##  Mean   :2021-06-17   Mean   :15.68   Mean   : 6.431   Mean   :2021  
##  3rd Qu.:2022-02-13   3rd Qu.:23.00   3rd Qu.: 9.000   3rd Qu.:2022  
##  Max.   :2022-10-26   Max.   :31.00   Max.   :12.000   Max.   :2022  
##                                                                      
##      cases             deaths         countriesAndTerritories     geoId      
##  Min.   :-348846   Min.   : -217.00   Finland  : 1024         FI     : 1024  
##  1st Qu.:    111   1st Qu.:    0.00   France   : 1006         FR     : 1006  
##  Median :    705   Median :    5.00   Czechia  : 1003         CZ     : 1003  
##  Mean   :   6088   Mean   :   40.87   Lithuania:  997         LT     :  997  
##  3rd Qu.:   3483   3rd Qu.:   31.00   Germany  :  992         DE     :  992  
##  Max.   : 501635   Max.   :13743.00   Sweden   :  982         SE     :  982  
##  NA's   :93        NA's   :292        (Other)  :22725         (Other):22725  
##  countryterritoryCode  popData2020       Incidence_rate     Fatality_rate      
##  FIN    : 1024        Min.   :   38747   Min.   :-518.189   Min.   : -0.32234  
##  FRA    : 1006        1st Qu.: 2095861   1st Qu.:   2.574   1st Qu.:  0.00000  
##  CZE    : 1003        Median : 6951482   Median :  13.119   Median :  0.08189  
##  LTU    :  997        Mean   :15348035   Mean   :  42.502   Mean   :  0.26944  
##  DEU    :  992        3rd Qu.:11522440   3rd Qu.:  41.035   3rd Qu.:  0.32034  
##  SWE    :  982        Max.   :83166711   Max.   :3785.420   Max.   :128.21679  
##  (Other):22725                           NA's   :93         NA's   :292
str(data)
## 'data.frame':    28729 obs. of  12 variables:
##  $ dateRep                : Date, format: "2022-10-23" "2022-10-22" ...
##  $ day                    : int  23 22 21 20 19 18 17 16 15 14 ...
##  $ month                  : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ year                   : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ cases                  : int  3557 5494 7776 8221 10007 13204 9964 6606 8818 11751 ...
##  $ deaths                 : int  0 4 4 6 8 7 8 12 6 10 ...
##  $ countriesAndTerritories: Factor w/ 30 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ geoId                  : Factor w/ 30 levels "AT","BE","BG",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ countryterritoryCode   : Factor w/ 30 levels "AUT","BEL","BGR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ popData2020            : int  8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 ...
##  $ Incidence_rate         : num  40 61.7 87.4 92.4 112.4 ...
##  $ Fatality_rate          : num  0 0.0449 0.0449 0.0674 0.0899 ...
# omit NAs
data = na.omit(data)
summary(data)
##     dateRep                day            month             year     
##  Min.   :2020-01-01   Min.   : 1.00   Min.   : 1.000   Min.   :2020  
##  1st Qu.:2020-10-16   1st Qu.: 8.00   1st Qu.: 4.000   1st Qu.:2020  
##  Median :2021-06-16   Median :16.00   Median : 6.000   Median :2021  
##  Mean   :2021-06-16   Mean   :15.68   Mean   : 6.441   Mean   :2021  
##  3rd Qu.:2022-02-13   3rd Qu.:23.00   3rd Qu.: 9.000   3rd Qu.:2022  
##  Max.   :2022-10-26   Max.   :31.00   Max.   :12.000   Max.   :2022  
##                                                                      
##      cases             deaths         countriesAndTerritories     geoId      
##  Min.   :-348846   Min.   : -217.00   Finland  : 1024         FI     : 1024  
##  1st Qu.:    119   1st Qu.:    0.00   France   : 1006         FR     : 1006  
##  Median :    723   Median :    5.00   Czechia  : 1002         CZ     : 1002  
##  Mean   :   6149   Mean   :   40.91   Lithuania:  997         LT     :  997  
##  3rd Qu.:   3554   3rd Qu.:   31.00   Germany  :  992         DE     :  992  
##  Max.   : 501635   Max.   :13743.00   Sweden   :  982         SE     :  982  
##                                       (Other)  :22341         (Other):22341  
##  countryterritoryCode  popData2020       Incidence_rate     Fatality_rate      
##  FIN    : 1024        Min.   :   38747   Min.   :-518.189   Min.   : -0.32234  
##  FRA    : 1006        1st Qu.: 2095861   1st Qu.:   2.577   1st Qu.:  0.00000  
##  CZE    : 1002        Median : 6951482   Median :  13.184   Median :  0.08189  
##  LTU    :  997        Mean   :15449398   Mean   :  42.152   Mean   :  0.27001  
##  DEU    :  992        3rd Qu.:11522440   3rd Qu.:  40.979   3rd Qu.:  0.32111  
##  SWE    :  982        Max.   :83166711   Max.   :3785.420   Max.   :128.21679  
##  (Other):22341
str(data)
## 'data.frame':    28344 obs. of  12 variables:
##  $ dateRep                : Date, format: "2022-10-23" "2022-10-22" ...
##  $ day                    : int  23 22 21 20 19 18 17 16 15 14 ...
##  $ month                  : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ year                   : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ cases                  : int  3557 5494 7776 8221 10007 13204 9964 6606 8818 11751 ...
##  $ deaths                 : int  0 4 4 6 8 7 8 12 6 10 ...
##  $ countriesAndTerritories: Factor w/ 30 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ geoId                  : Factor w/ 30 levels "AT","BE","BG",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ countryterritoryCode   : Factor w/ 30 levels "AUT","BEL","BGR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ popData2020            : int  8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 ...
##  $ Incidence_rate         : num  40 61.7 87.4 92.4 112.4 ...
##  $ Fatality_rate          : num  0 0.0449 0.0449 0.0674 0.0899 ...
##  - attr(*, "na.action")= 'omit' Named int [1:385] 1433 1434 1436 1440 1442 1446 1808 1928 1931 1932 ...
##   ..- attr(*, "names")= chr [1:385] "1433" "1434" "1436" "1440" ...

We found some values of cases and deaths have negative value, so we decide to filter these negative value out

# filter cases > 0, death > 0 

data = data %>%
  filter(cases >0)  %>%
  filter(deaths >0) 
# A visualization exploring new cases, per country, over time.
top_mean_cases <- data %>%
  group_by(countriesAndTerritories) %>%
  summarise_at(vars(cases), list(name = mean)) %>%
  arrange(desc(name)) %>%
  slice_head(n = 7) %>%    
  pull(countriesAndTerritories) 

top_mean_cases
## [1] France      Germany     Italy       Spain       Greece      Netherlands
## [7] Poland     
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
top_countries_cases <- data %>%
  filter(countriesAndTerritories %in% top_mean_cases)
head(top_countries_cases)
##      dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-24  24    10 2022  6907    122                  France    FR
## 2 2022-10-23  23    10 2022 31470     17                  France    FR
## 3 2022-10-22  22    10 2022 43746     33                  France    FR
## 4 2022-10-21  21    10 2022 49087     81                  France    FR
## 5 2022-10-20  20    10 2022 56793     69                  France    FR
## 6 2022-10-19  19    10 2022 63031    106                  France    FR
##   countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1                  FRA    67320216       10.25992    0.18122342
## 2                  FRA    67320216       46.74673    0.02525244
## 3                  FRA    67320216       64.98197    0.04901945
## 4                  FRA    67320216       72.91569    0.12032047
## 5                  FRA    67320216       84.36247    0.10249521
## 6                  FRA    67320216       93.62864    0.15745642
p <- ggplot(top_countries_cases, aes(x = round_date(as.POSIXct(dateRep), "days"), 
                                 y = cases, color = countriesAndTerritories)) +
  geom_line() + 
  xlab("")+
  theme_classic()+
  labs(title = "New Cases From 2020 to 2022 ",
         x = "Date",
         y = "New Cases")+
  theme(plot.title = element_text(hjust = 0.5, size = 15))
p

### In the new cases plot, France and the Netherlands have the highest number of cases in 2022, indicating the covid-19 outbreak in 2022 in these countries.

Visuals for Incidence Rate

top_mean_incidence <- data %>%
  group_by(countriesAndTerritories) %>%
  summarise_at(vars(Incidence_rate), list(name = mean)) %>%
  arrange(desc(name)) %>%
  slice_head(n = 7) %>%    
  pull(countriesAndTerritories) 

top_mean_incidence
## [1] Iceland       Cyprus        Greece        Liechtenstein Estonia      
## [6] Slovenia      Denmark      
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
top_countries_incidence <- data %>%
  filter(countriesAndTerritories %in% top_mean_incidence)
head(top_countries_incidence)
##      dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-20  20    10 2022  2755      2                  Cyprus    CY
## 2 2022-10-13  13    10 2022  2759      2                  Cyprus    CY
## 3 2022-10-06   6    10 2022  2789      5                  Cyprus    CY
## 4 2022-09-29  29     9 2022  2681      2                  Cyprus    CY
## 5 2022-09-22  22     9 2022  2932      2                  Cyprus    CY
## 6 2022-09-15  15     9 2022  2482      5                  Cyprus    CY
##   countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1                  CYP      888005       310.2460     0.2252240
## 2                  CYP      888005       310.6964     0.2252240
## 3                  CYP      888005       314.0748     0.5630599
## 4                  CYP      888005       301.9127     0.2252240
## 5                  CYP      888005       330.1783     0.2252240
## 6                  CYP      888005       279.5029     0.5630599
p <- ggplot(top_countries_incidence, aes(x = round_date(as.POSIXct(dateRep), "days"), 
                                 y = Incidence_rate, color = countriesAndTerritories)) +
  geom_line() + 
  xlab("")+
  theme_classic()+
  labs(title = "Incidence Rate From 2020 to 2022 ",
         x = "Date",
         y = "Incidence Rate")+
  theme(plot.title = element_text(hjust = 0.5, size = 15))
p

Iceland has the highest incidence rate in 2022.

p <- ggplot(top_countries_incidence, aes(x = countriesAndTerritories, y = Incidence_rate))+
  geom_boxplot()
p

As we can see, there are many extreme outliers in the boxplot. Let’s try log transformation

p <- ggplot(top_countries_incidence, aes(x = round_date(as.POSIXct(dateRep), "days"), 
                                 y = log10(Incidence_rate), color = countriesAndTerritories)) +
  geom_line() + 
  xlab("")+
  theme_classic()+
  labs(title = "Incidence Rate From 2020 to 2022 ",
         x = "Date",
         y = "Log Incidence Rate")+
  theme(plot.title = element_text(hjust = 0.5, size = 15))
p

p <- ggplot(top_countries_incidence, aes(x = countriesAndTerritories, y = log10(Incidence_rate)))+
  geom_boxplot()
p

Log Transformation doesn’t work because now we have a lot of extremely small outliers. Try to inspect countries individually:

top_mean_incidence
## [1] Iceland       Cyprus        Greece        Liechtenstein Estonia      
## [6] Slovenia      Denmark      
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
Cyprus <- data %>%
  filter(countriesAndTerritories == 'Cyprus')
p <- ggplot(Cyprus, aes(x = round_date(as.POSIXct(dateRep), "days"), 
                                 y = Incidence_rate)) +
  geom_line(color = 'blue') + 
  xlab("")+
  theme_classic()+
  labs(title = " Cyprus Incidence Rate From 2020 to 2022 ",
         x = "Date",
         y = "Incidence Rate")+
  theme(plot.title = element_text(hjust = 0.5, size = 15))
p

p <- ggplot(Cyprus, aes(x = countriesAndTerritories, y = Incidence_rate))+
  geom_boxplot()
p

Find extreme outliers:

quantile(Cyprus$Incidence_rate, probs = 0.75)
##      75% 
## 204.2218
upper = 69.96019+3*IQR(Cyprus$Incidence_rate)
extreme_outliers = Cyprus[Cyprus$Incidence_rate>upper,]
extreme_outliers
##       dateRep day month year cases deaths countriesAndTerritories geoId
## 13 2022-07-28  28     7 2022  6863     16                  Cyprus    CY
## 14 2022-07-21  21     7 2022 10152     13                  Cyprus    CY
## 15 2022-07-14  14     7 2022 15386      7                  Cyprus    CY
## 16 2022-07-07   7     7 2022 14914      4                  Cyprus    CY
## 17 2022-06-30  30     6 2022 10879      3                  Cyprus    CY
## 18 2022-06-23  23     6 2022  7263      2                  Cyprus    CY
## 25 2022-05-05   5     5 2022  9559     12                  Cyprus    CY
## 27 2022-04-21  21     4 2022  6115     18                  Cyprus    CY
## 45 2022-03-28  28     3 2022  6332      3                  Cyprus    CY
##    countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 13                  CYP      888005       772.8560     1.8017917
## 14                  CYP      888005      1143.2368     1.4639557
## 15                  CYP      888005      1732.6479     0.7882838
## 16                  CYP      888005      1679.4950     0.4504479
## 17                  CYP      888005      1225.1057     0.3378359
## 18                  CYP      888005       817.9008     0.2252240
## 25                  CYP      888005      1076.4579     1.3513437
## 27                  CYP      888005       688.6222     2.0270156
## 45                  CYP      888005       713.0590     0.3378359
summary(extreme_outliers)
##     dateRep                day            month            year     
##  Min.   :2022-03-28   Min.   : 5.00   Min.   :3.000   Min.   :2022  
##  1st Qu.:2022-05-05   1st Qu.:14.00   1st Qu.:5.000   1st Qu.:2022  
##  Median :2022-06-30   Median :21.00   Median :6.000   Median :2022  
##  Mean   :2022-06-12   Mean   :19.67   Mean   :5.778   Mean   :2022  
##  3rd Qu.:2022-07-14   3rd Qu.:28.00   3rd Qu.:7.000   3rd Qu.:2022  
##  Max.   :2022-07-28   Max.   :30.00   Max.   :7.000   Max.   :2022  
##                                                                     
##      cases           deaths       countriesAndTerritories     geoId  
##  Min.   : 6115   Min.   : 2.000   Cyprus  :9              CY     :9  
##  1st Qu.: 6863   1st Qu.: 3.000   Austria :0              AT     :0  
##  Median : 9559   Median : 7.000   Belgium :0              BE     :0  
##  Mean   : 9718   Mean   : 8.667   Bulgaria:0              BG     :0  
##  3rd Qu.:10879   3rd Qu.:13.000   Croatia :0              CZ     :0  
##  Max.   :15386   Max.   :18.000   Czechia :0              DE     :0  
##                                   (Other) :0              (Other):0  
##  countryterritoryCode  popData2020     Incidence_rate   Fatality_rate   
##  CYP    :9            Min.   :888005   Min.   : 688.6   Min.   :0.2252  
##  AUT    :0            1st Qu.:888005   1st Qu.: 772.9   1st Qu.:0.3378  
##  BEL    :0            Median :888005   Median :1076.5   Median :0.7883  
##  BGR    :0            Mean   :888005   Mean   :1094.4   Mean   :0.9760  
##  CZE    :0            3rd Qu.:888005   3rd Qu.:1225.1   3rd Qu.:1.4640  
##  DEU    :0            Max.   :888005   Max.   :1732.6   Max.   :2.0270  
##  (Other):0
table(extreme_outliers$month)
## 
## 3 4 5 6 7 
## 1 1 1 2 4
table(extreme_outliers$year)
## 
## 2022 
##    9

The extreme outliers in Cyprus happened mainly during the winter of 2022.

Visuals for Fatality Rate:

# A visualization exploring fatality rate, per country, over time.
top_mean_fatality <- data %>%
  group_by(countriesAndTerritories) %>%
  summarise_at(vars(Fatality_rate), list(name = mean)) %>%
  arrange(desc(name)) %>%
  slice_head(n = 7) %>%    
  pull(countriesAndTerritories) 

top_mean_fatality
## [1] Liechtenstein Greece        Iceland       Hungary       Bulgaria     
## [6] Slovakia      Croatia      
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
top_countries_fatality <- data %>%
  filter(countriesAndTerritories %in% top_mean_incidence)
head(top_countries_fatality)
##      dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-20  20    10 2022  2755      2                  Cyprus    CY
## 2 2022-10-13  13    10 2022  2759      2                  Cyprus    CY
## 3 2022-10-06   6    10 2022  2789      5                  Cyprus    CY
## 4 2022-09-29  29     9 2022  2681      2                  Cyprus    CY
## 5 2022-09-22  22     9 2022  2932      2                  Cyprus    CY
## 6 2022-09-15  15     9 2022  2482      5                  Cyprus    CY
##   countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1                  CYP      888005       310.2460     0.2252240
## 2                  CYP      888005       310.6964     0.2252240
## 3                  CYP      888005       314.0748     0.5630599
## 4                  CYP      888005       301.9127     0.2252240
## 5                  CYP      888005       330.1783     0.2252240
## 6                  CYP      888005       279.5029     0.5630599
p <- ggplot(top_countries_fatality, aes(x = round_date(as.POSIXct(dateRep), "days"), 
                                 y = Fatality_rate, color = countriesAndTerritories)) +
  geom_line() + 
  xlab("")+
  theme_classic()+
  labs(title = "Fatality Rate From 2020 to 2022 ",
         x = "Date",
         y = "New Cases")+
  theme(plot.title = element_text(hjust = 0.5, size = 15))
p

Greece has an extremely high fatality rate in 2021.

p <- ggplot(top_countries_fatality, aes(x = countriesAndTerritories, y = Fatality_rate))+
  geom_boxplot()
p

p <- ggplot(top_countries_fatality, aes(x = round_date(as.POSIXct(dateRep), "days"), 
                                 y = log10(Fatality_rate), color = countriesAndTerritories)) +
  geom_line() + 
  xlab("")+
  theme_classic()+
  labs(title = "Fatality Rate From 2020 to 2022 ",
         x = "Date",
         y = "Log Fatality Rate")+
  theme(plot.title = element_text(hjust = 0.5, size = 15))
p

p <- ggplot(top_countries_fatality, aes(x = countriesAndTerritories, y = log10(Fatality_rate)))+
  geom_boxplot()
p

# A table or visualization exploring some other aspect of the data.
str(data)
## 'data.frame':    20518 obs. of  12 variables:
##  $ dateRep                : Date, format: "2022-10-22" "2022-10-21" ...
##  $ day                    : int  22 21 20 19 18 17 16 15 14 13 ...
##  $ month                  : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ year                   : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ cases                  : int  5494 7776 8221 10007 13204 9964 6606 8818 11751 13068 ...
##  $ deaths                 : int  4 4 6 8 7 8 12 6 10 14 ...
##  $ countriesAndTerritories: Factor w/ 30 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ geoId                  : Factor w/ 30 levels "AT","BE","BG",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ countryterritoryCode   : Factor w/ 30 levels "AUT","BEL","BGR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ popData2020            : int  8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 ...
##  $ Incidence_rate         : num  61.7 87.4 92.4 112.4 148.3 ...
##  $ Fatality_rate          : num  0.0449 0.0449 0.0674 0.0899 0.0786 ...
##  - attr(*, "na.action")= 'omit' Named int [1:385] 1433 1434 1436 1440 1442 1446 1808 1928 1931 1932 ...
##   ..- attr(*, "names")= chr [1:385] "1433" "1434" "1436" "1440" ...
# A visualization exploring new cases, per country, over time.
top_mean_deaths <- data %>%
  group_by(countriesAndTerritories) %>%
  summarise_at(vars(deaths), list(name = mean)) %>%
  arrange(desc(name)) %>%
  slice_head(n = 7) %>%    
  pull(countriesAndTerritories) 

top_mean_deaths
## [1] Italy   France  Germany Poland  Spain   Greece  Hungary
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
top_countries_deaths <- data %>%
  filter(countriesAndTerritories %in% top_mean_deaths)
head(top_countries_deaths)
##      dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-24  24    10 2022  6907    122                  France    FR
## 2 2022-10-23  23    10 2022 31470     17                  France    FR
## 3 2022-10-22  22    10 2022 43746     33                  France    FR
## 4 2022-10-21  21    10 2022 49087     81                  France    FR
## 5 2022-10-20  20    10 2022 56793     69                  France    FR
## 6 2022-10-19  19    10 2022 63031    106                  France    FR
##   countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1                  FRA    67320216       10.25992    0.18122342
## 2                  FRA    67320216       46.74673    0.02525244
## 3                  FRA    67320216       64.98197    0.04901945
## 4                  FRA    67320216       72.91569    0.12032047
## 5                  FRA    67320216       84.36247    0.10249521
## 6                  FRA    67320216       93.62864    0.15745642
p <- ggplot(top_countries_deaths, aes(x = round_date(as.POSIXct(dateRep), "days"), 
                                 y = deaths, color = countriesAndTerritories)) +
  geom_line() + 
  xlab("")+
  theme_classic()+
  labs(title = "Deaths From 2020 to 2022 ",
         x = "Date",
         y = "Deaths")+
  theme(plot.title = element_text(hjust = 0.5, size = 15))
p

#explore case fatality rates per country
case_fatality_rate_summary <- data %>%
  group_by(countriesAndTerritories) %>%
  summarise(case_fatality_rate = sum(deaths)/sum(cases)) %>%
  arrange(desc(case_fatality_rate)) 
case_fatality_rate_summary
## # A tibble: 30 × 2
##    countriesAndTerritories case_fatality_rate
##    <fct>                                <dbl>
##  1 Liechtenstein                      0.0425 
##  2 Bulgaria                           0.0300 
##  3 Hungary                            0.0224 
##  4 Romania                            0.0207 
##  5 Poland                             0.0192 
##  6 Croatia                            0.0138 
##  7 Ireland                            0.0132 
##  8 Malta                              0.0105 
##  9 Czechia                            0.0101 
## 10 Slovakia                           0.00828
## # ℹ 20 more rows

A visualization exploring case fatality rates per country

#A visualization exploring case fatality rates per country
ggplot(case_fatality_rate_summary, aes(x = reorder(countriesAndTerritories, -case_fatality_rate), y = case_fatality_rate, fill = countriesAndTerritories)) +
  geom_bar(stat = "identity") +
  coord_flip() +  
  labs(title = "Case Fatality Rates by Country",
       x = "Country",
       y = "Case Fatality Rate ") +
  theme_minimal() +
  scale_fill_viridis_d()  

2. Inferential Statistics

Select two (2) countries of your choosing and compare their incidence or fatality rates using hypothesis testing. At minimum, your work should include the following:

  • Visualization(s) comparing the daily incidence or fatality rates of the selected countries,
  • A statement of the null hypothesis.
  • A short justification of the statistical test selected.
    • Why is the test you selected an appropriate one for the comparison we’re making?
  • A brief discussion of any distributional assumptions of that test.
    • Does the statistical test we selected require assumptions about our data?
    • If so, does our data satisfy those assumptions?
  • Your selected alpha.
  • The test function output; i.e. the R output.
  • The relevant confidence interval, if not returned by the R test output.
  • A concluding statement on the outcome of the statistical test.
    • i.e. Based on our selected alpha, do we reject or fail to reject our null hypothesis?

Incidence Rate

# Incidence Rate
# Hypothesis Test for Difference in Means

selected_countries <- c("Spain", "Italy")
filtered_data <- data[data$countriesAndTerritories %in% selected_countries,]

ggplot(filtered_data, aes(x = Incidence_rate, fill = countriesAndTerritories)) +
  geom_histogram(position = "dodge", binwidth = 20, alpha = 0.7) +
  labs(title = "Incidence Rate Distributions with Confidence Intervals",
       x = "Incidence rate",
       y = "Frequency") +
  facet_wrap(~ countriesAndTerritories) +
  theme_minimal() +
  theme(legend.position = "none")

Two-sided t-test Null Hypothesis (H0): There is no difference in the mean incidence rates between Spain and Italy.

Justification: We selected a two-sided t-test for comparing the mean incidence rates of Spain and Italy. This test is appropriate to compare the means of two independent samples.

Assumption Check: By visually inspecting the histograms of the incidence rates for Spain and Italy above, we can see the distributions of incidence rates in both Spain and Italy are not perfectly normally distributed while N > 30 in both countries, so the t-test is suitable for this analysis.

Selected Alpha: Our selected alpha level is 0.05, which means that we are willing to accept a 5% chance of committing a Type I error when rejecting the null hypothesis.

spain_data <- filtered_data[filtered_data$countriesAndTerritories == 'Spain',]
italy_data <- filtered_data[filtered_data$countriesAndTerritories == 'Italy',]

t.test(spain_data$Incidence_rate, italy_data$Incidence_rate, alternative = c("two.sided"), mu = 0, paired = FALSE, var.equal = TRUE, conf.level = 0.95)
## 
##  Two Sample t-test
## 
## data:  spain_data$Incidence_rate and italy_data$Incidence_rate
## t = -1.9531, df = 1789, p-value = 0.05097
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.64326657   0.02232826
## sample estimates:
## mean of x mean of y 
##  34.98388  40.29435
conf_interval_spain <- round(t.test(spain_data$Incidence_rate)$conf.int, 2)
conf_interval_italy <- round(t.test(italy_data$Incidence_rate)$conf.int, 2)

print(paste("Confidence Interval for Spain: (", conf_interval_spain[1], ",", conf_interval_spain[2], ")."))
## [1] "Confidence Interval for Spain: ( 31.23 , 38.74 )."
print(paste("Confidence Interval for Italy: (", conf_interval_italy[1], ",", conf_interval_italy[2], ")."))
## [1] "Confidence Interval for Italy: ( 36.55 , 44.04 )."

Conclusion of the t-test: The p-value 0.05097 is slightly larger than 0.05, so we may fail to reject the null hypothesis there is no difference in the mean incidence rates between Spain and Italy.

combined_data_incidence <- rbind(
  data.frame(Incidence_rate = spain_data$Incidence_rate, Country = 'Spain'),
  data.frame(Incidence_rate = italy_data$Incidence_rate, Country = 'Italy')
)

ggplot(combined_data_incidence, aes(x = Incidence_rate, fill = Country)) +
  geom_histogram(position = "dodge", binwidth = 20, alpha = 0.7) +
  geom_vline(data = data.frame(Country = 'Spain', xintercept = conf_interval_spain), aes(xintercept = xintercept, color = Country), linetype = "dashed") +
  geom_vline(data = data.frame(Country = 'Italy', xintercept = conf_interval_italy), aes(xintercept = xintercept, color = Country), linetype = "dashed") +
  scale_color_manual(values = c('Spain' = 'darkorange', 'Italy' = 'darkorange')) +
  labs(title = "Incidence Rate Distributions with Confidence Intervals",
       x = "Incidence rate",
       y = "Frequency") +
  facet_wrap(~ Country) +
  theme_minimal() +
  theme(legend.position = "none")

# Hypothesis Test for Variances
selected_countries <- c("Spain", "Italy")
filtered_data <- data[data$countriesAndTerritories %in% selected_countries,]

ggplot(filtered_data, aes(x = countriesAndTerritories, y = Incidence_rate, fill = countriesAndTerritories)) +
  geom_boxplot() +
  labs(title = "Incidence Rate: Spain vs. Italy",
       x = "Country",
       y = "Incidence Rate") +
  theme_minimal()

ANOVA Test The null hypothesis (H0): There are no significant differences in the mean incidence rates between Spain and Italy.

Justification: We selected the Analysis of Variance (ANOVA) test to assess whether there are differences in the mean incidence rates between Spain and Italy because ANOVA test is appropriate for comparing means across multiple groups, in this case, the two countries Spain and Italy.

two_countries_data <- data %>%
  filter(countriesAndTerritories %in% c("Spain", "Italy"))

bartlett.test(Incidence_rate ~ countriesAndTerritories, two_countries_data)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Incidence_rate by countriesAndTerritories
## Bartlett's K-squared = 6.1314, df = 1, p-value = 0.01328

Assumption Check: From the box-plots above, we can see that the two countries have similar spreads, which could support the assumption of homogeneity of variances. However, from the bartlett test, we can see that the p value 0.01328 is small, so we should reject the null hypothesis and have evidence to support the alternative hypothesis that the variances within the two countries are not equal. Violations of normality and homogeneity of variances may affect the accuracy of the results, but ANOVA is still robust especially with larger sample sizes in both countries.

Selected Alpha: Our selected alpha level is 0.05.

anova_result <- aov(Incidence_rate ~ countriesAndTerritories, two_countries_data)
summary(anova_result)
##                           Df  Sum Sq Mean Sq F value Pr(>F)  
## countriesAndTerritories    1   12535   12535   3.815  0.051 .
## Residuals               1789 5878832    3286                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion of the Statistical Test: The p-value 0.051 is slightly larger than 0.05, so we may fail to reject the null hypothesis that there are no significant differences in the mean incidence rates between Slovenia and France.

Fatality Rate

selected_countries <- c("Spain", "Italy")
filtered_data <- data[data$countriesAndTerritories %in% selected_countries,]

ggplot(filtered_data, aes(x = Fatality_rate, fill = countriesAndTerritories)) +
  geom_histogram(position = "dodge", binwidth = 0.1, alpha = 0.7) +
  labs(title = "Fatality Rate Distributions with Confidence Intervals",
       x = "Fatality rate",
       y = "Frequency") +
  facet_wrap(~ countriesAndTerritories) +
  theme_minimal() +
  theme(legend.position = "none")

Two-sided t-test Null Hypothesis (H0): There is no difference in the mean fatality rates between Spain and Italy.

Justification: We selected a two-sided t-test for comparing the mean fatality rates of Spain and Italy. This test is appropriate to compare the means of two independent samples.

Assumption Check: By visually inspecting the histograms of the fatality rates for Spain and Italy above, we can see the distributions of fatality rates in both Spain and Italy are not perfectly normally distributed while N > 30 in both countries, so the t-test is suitable for this analysis.

Selected Alpha: Our selected alpha level is 0.05, which means that we are willing to accept a 5% chance of committing a Type I error when rejecting the null hypothesis.

T-test

spain_data <- filtered_data[filtered_data$countriesAndTerritories == 'Spain',]
italy_data <- filtered_data[filtered_data$countriesAndTerritories == 'Italy',]

t.test(spain_data$Fatality_rate, italy_data$Fatality_rate, alternative = c("two.sided"), mu = 0, paired = FALSE,
       var.equal = TRUE, conf.level = 0.95)
## 
##  Two Sample t-test
## 
## data:  spain_data$Fatality_rate and italy_data$Fatality_rate
## t = -1.2741, df = 1789, p-value = 0.2028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.05104086  0.01084147
## sample estimates:
## mean of x mean of y 
## 0.2880119 0.3081116
conf_interval_spain_fa <- round(t.test(spain_data$Fatality_rate)$conf.int, 2)
conf_interval_italy_fa <- round(t.test(italy_data$Fatality_rate)$conf.int, 2)

print(paste("Confidence Interval for Spain: (", conf_interval_spain_fa[1], ",", conf_interval_spain_fa[2], ")."))
## [1] "Confidence Interval for Spain: ( 0.27 , 0.31 )."
print(paste("Confidence Interval for Italy: (", conf_interval_italy_fa[1], ",", conf_interval_italy_fa[2], ")."))
## [1] "Confidence Interval for Italy: ( 0.29 , 0.33 )."

Conclusion of the t-test: Given the p-value (0.2028) is greater than the conventional alpha level of 0.05, we fail to reject the null hypothesis. This implies there is no statistically significant difference in the mean daily fatality rates between Spain and Italy, based on the provided data.

# Calculate confidence intervals
calculate_confidence_interval <- function(data, confidence_level = 0.95) {
  n <- length(data)
  mean_value <- mean(data)
  sd_value <- sd(data)
  alpha <- 1 - confidence_level
  critical_value <- qnorm(1 - alpha / 2)
  margin_of_error <- critical_value * (sd_value / sqrt(n))
  
  return(c(mean_value - margin_of_error, mean_value + margin_of_error))
}

# Create a combined dataframe for plotting
combined_data <- rbind(
  data.frame(Fatality_rate = spain_data$Fatality_rate, Country = 'Spain'),
  data.frame(Fatality_rate = italy_data$Fatality_rate, Country = 'Italy')
)

# Create histograms with confidence intervals
ggplot(combined_data, aes(x = Fatality_rate, fill = Country)) +
  geom_histogram(position = "dodge", binwidth = 0.1, alpha = 0.7) +
  geom_vline(data = data.frame(Country = 'Spain', xintercept = conf_interval_spain_fa), aes(xintercept = xintercept, color = Country), linetype = "dashed") +
  geom_vline(data = data.frame(Country = 'Italy', xintercept = conf_interval_italy_fa), aes(xintercept = xintercept, color = Country), linetype = "dashed") +
  scale_color_manual(values = c('Spain' = 'orange', 'Italy' = 'orange')) +
  labs(title = "Fatality Rate Distributions with Confidence Intervals",
       x = "Fatality rate",
       y = "Frequency") +
  facet_wrap(~ Country) +
  theme_minimal() +
  theme(legend.position = "none")

ANOVA Test The null hypothesis (H0): There are no significant differences in the mean fatality rates between Spain and Italy.

Justification: We selected the Analysis of Variance (ANOVA) test to assess whether there are differences in the mean fatality rates between Spain and Italy because ANOVA test is appropriate for comparing means across multiple groups, in this case, the two countries Spain and Italy.

# Box Plot for variance comparison
ggplot(filtered_data, aes(x = countriesAndTerritories, y = Fatality_rate, fill = countriesAndTerritories)) +
  geom_boxplot() +
  labs(title = "Variance of Fatality Rate: Spain vs Italy",
       x = "Country",
       y = "Fatality Rate") +
  theme_minimal()

two_countries_data <- data %>%
  filter(countriesAndTerritories %in% c("Spain", "Italy"))

bartlett.test(Fatality_rate ~ countriesAndTerritories, two_countries_data)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Fatality_rate by countriesAndTerritories
## Bartlett's K-squared = 0.19495, df = 1, p-value = 0.6588

Assumption Check: The box plots above show that the two countries have similar spreads, which could support the concept of variance homogeneity. We should not reject the null hypothesis because the p-value of 0.6588 is greater than the standard alpha level of 0.05. Because the Bartlett test indicates that the variances are not statistically different, it validates the use of such tests. Violations of normality and variance homogeneity may impair the accuracy of the results, but ANOVA remains robust, particularly with higher sample sizes in both countries.

Selected Alpha: Our selected alpha level is 0.05.

# ANOVA test
anova_result <- aov(Fatality_rate ~ countriesAndTerritories, data = filtered_data)
summary(anova_result)
##                           Df Sum Sq Mean Sq F value Pr(>F)
## countriesAndTerritories    1   0.18  0.1796   1.623  0.203
## Residuals               1789 197.90  0.1106

Conclusion of the Statistical Test: The p-value (0.203) is greater than the standard alpha level (0.05), indicating that we fail to reject the null hypothesis. This suggests there is no statistically significant difference in both countries.


3. Correlation

Considering all countries, explore the relationship between incidence rates and fatality rates. At minimum, your work should include the following:

  • Visualization(s) showing the distributions of daily incidence and fatality rates, regardless of country. Please note that both country and date should be disregarded here.
  • A short statement identifying the most appropriate correlation coefficient.
    • For the correlation we’re interested in, which correlation coefficient is most appropriate?
    • Why do you find the correlation coefficient selected to be the most appropriate?
  • The calculated correlation coefficient or coefficient test output; e.g. cor() or cor.test().
ggplot(data, aes(x = Incidence_rate)) +
  geom_histogram(bins = 35, fill = "lightblue", alpha = 1.2) +
  labs(title = "Distribution of Daily Incidence Rates", x = "Incidence Rate", y = "Frequency")

# Histogram for Daily Fatality Rates
ggplot(data, aes(x = Fatality_rate)) +
  geom_histogram(bins = 30, fill = "purple", alpha = 1.2) +
  labs(title = "Distribution of Daily Fatality Rates", x = "Fatality Rate", y = "Frequency")

data <- na.omit(data)

pearson_corr <- cor(data$Incidence_rate, data$Fatality_rate, method = "pearson")
print(paste("Pearson Correlation Coefficient:", pearson_corr))
## [1] "Pearson Correlation Coefficient: 0.0950094002591344"
spearman_corr <- cor(data$Incidence_rate, data$Fatality_rate, method = "spearman")
print(paste("Spearman Correlation Coefficient:", spearman_corr))
## [1] "Spearman Correlation Coefficient: 0.56195982704042"

Answer: (From the distribution of daily incidence rate and fatality rate, two data trends don’t really follow normal distributions, The distribution of incidence rates may be skewed, indicating that most countries have lower incidence rates with fewer examples of very high rates. Fatality Rate is similiar to incidence rates.The Spearman correlation would be more fitted for the non-normal distribution, which can provide a more reliable measure between incidence and fatality rates under these conditions.)

4. Regression

Here, we will fit a model on data from twenty (20) countries considering total new cases as a function of population, population density and gross domestic product (GDP) per capita. Note that the GDP per capita is given in “purchasing power standard,” which considers the costs of goods and services in a country relative to incomes in that country; i.e. we will consider this as appropriately standardized.

Code is given below defining a new data frame, ‘model_df,’ which provides the total area and standardized GDP per capita for the twenty (20) countries for our model fit. You are responsible for creating a vector of the total new cases across the time frame of the dataset, for each of those countries, and adding that vector to our ’model_df” data frame.

# The code below creates a new data frame, 'model_df,' that includes the area,
# GDP per capita, population and population density for the twenty (20)
# countries of interest. All you should need to do is execute this code, as is.

# You do not need to add code in this chunk. You will need to add code in the
# 'regression_b,' 'regression_c' and 'regression_d' code chunks.

twenty_countries <- c("Austria", "Belgium", "Bulgaria", "Cyprus", "Denmark",
                      "Finland", "France", "Germany", "Hungary", "Ireland",
                      "Latvia", "Lithuania", "Malta", "Norway", "Poland",
                      "Portugal", "Romania", "Slovakia", "Spain", "Sweden")

sq_km <- c(83858, 30510, 110994, 9251, 44493, 338145, 551695, 357386, 93030,
           70273, 64589, 65300, 316, 385178, 312685, 88416, 238397, 49036,
           498511, 450295)

gdp_pps <- c(128, 118, 51, 91, 129, 111, 104, 123, 71, 190, 69, 81, 100, 142,
             71, 78, 65, 71, 91, 120)

model_df <- data %>%
  select(c(countriesAndTerritories, popData2020)) %>%
  filter(countriesAndTerritories %in% twenty_countries) %>%
  distinct(countriesAndTerritories, .keep_all = TRUE) %>%
  add_column(sq_km, gdp_pps) %>%
  mutate(pop_dens = popData2020 / sq_km) %>%
  rename(country = countriesAndTerritories, pop = popData2020)

Next, we need to add one (1) more column to our ‘model_df’ data frame. Specifically, one that has the total number of new cases for each of the twenty (20) countries. We calculate the total number of new cases by summing all the daily new cases, for each country, across all the days in the dataset.

### The following code will be removed for students to complete the work themselves.

total_cases <- data %>%
  select(c(countriesAndTerritories, cases)) %>%
  group_by(countriesAndTerritories) %>%
  dplyr::summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
  filter(countriesAndTerritories %in% twenty_countries) %>%
  select(total_cases)

model_df <- model_df %>%
  add_column(total_cases)
model_df
##      country      pop  sq_km gdp_pps   pop_dens total_cases
## 1    Austria  8901064  83858     128  106.14448     5387997
## 2    Belgium 11522440  30510     118  377.66109     4597870
## 3   Bulgaria  6951482 110994      51   62.62935     1260132
## 4     Cyprus   888005   9251      91   95.99016      519761
## 5    Denmark  5822763  44493     129  130.86919     3163687
## 6    Finland  5525292 338145     111   16.34001     1305112
## 7     France 67320216 551695     104  122.02434    36942541
## 8    Germany 83166711 357386     123  232.70836    35284167
## 9    Hungary  9769526  93030      71  105.01479     2138099
## 10   Ireland  4964440  70273     190   70.64506      607291
## 11    Latvia  1907675  64589      69   29.53560      902622
## 12 Lithuania  2794090  65300      81   42.78851     1241739
## 13     Malta   514564    316     100 1628.36709       76852
## 14    Norway  5367580 385178     142   13.93532      549922
## 15    Poland 37958138 312685      71  121.39418     6157508
## 16  Portugal 10295909  88416      78  116.44848     5505030
## 17   Romania 19328838 238397      65   81.07836     3245139
## 18  Slovakia  5457873  49036      71  111.30339     2475792
## 19     Spain 47332614 498511      91   94.94798    13561646
## 20    Sweden 10327589 450295     120   22.93516     2585009

Now, we will fit our model using the data in ‘model_df.’ We are interested in explaining total cases (response) as a function of population (explanatory), population density (explanatory), and GDP (explanatory).

At minimum, your modeling work should including the following:

  • A description - either narrative or using R output - of your ‘model_df’ data frame.
    • Consider: what data types are present? What do our rows and columns represent?
  • The lm() summary() output of your fitted model. As we did in the second Data Analysis Assignment, you can pass your fitted model object - i.e. the output of lm() - to summary() and get additional details, including R^2, on your model fit.
  • A short statement on the fit of the model.
    • Which, if any, of our coefficients are statistically significant?
    • What is the R^2 of our model?
    • Should we consider a reduced model; i.e. one with fewer parameters?

Description of model_df

# Description of model_df
str(model_df)
## 'data.frame':    20 obs. of  6 variables:
##  $ country    : Factor w/ 30 levels "Austria","Belgium",..: 1 2 3 5 7 9 10 11 13 15 ...
##  $ pop        : int  8901064 11522440 6951482 888005 5822763 5525292 67320216 83166711 9769526 4964440 ...
##  $ sq_km      : num  83858 30510 110994 9251 44493 ...
##  $ gdp_pps    : num  128 118 51 91 129 111 104 123 71 190 ...
##  $ pop_dens   : num  106.1 377.7 62.6 96 130.9 ...
##  $ total_cases: int  5387997 4597870 1260132 519761 3163687 1305112 36942541 35284167 2138099 607291 ...
##  - attr(*, "na.action")= 'omit' Named int [1:385] 1433 1434 1436 1440 1442 1446 1808 1928 1931 1932 ...
##   ..- attr(*, "names")= chr [1:385] "1433" "1434" "1436" "1440" ...
summary(model_df)
##      country        pop               sq_km           gdp_pps     
##  Austria : 1   Min.   :  514564   Min.   :   316   Min.   : 51.0  
##  Belgium : 1   1st Qu.: 5266795   1st Qu.: 60701   1st Qu.: 71.0  
##  Bulgaria: 1   Median : 7926273   Median : 90723   Median : 95.5  
##  Cyprus  : 1   Mean   :17305840   Mean   :192118   Mean   :100.2  
##  Denmark : 1   3rd Qu.:13474040   3rd Qu.:342955   3rd Qu.:120.8  
##  Finland : 1   Max.   :83166711   Max.   :551695   Max.   :190.0  
##  (Other) :14                                                      
##     pop_dens        total_cases      
##  Min.   :  13.94   Min.   :   76852  
##  1st Qu.:  57.67   1st Qu.: 1156960  
##  Median : 100.50   Median : 2530400  
##  Mean   : 179.14   Mean   : 6375396  
##  3rd Qu.: 121.55   3rd Qu.: 5417255  
##  Max.   :1628.37   Max.   :36942541  
## 

***Answer: country: Categorical variable represented as a factor with 20 distinct categories. pop: Population count for each country, formatted as an integer value. sq_km: The land area of each country in square kilometers, an integer value. gdp_pps: GDP per capita expressed in purchasing power standards, an integer that adjusts for varying prices and income levels across countries. pop_dens: Population density, an integer calculated by dividing the population by the land area in square kilometers. total_cases: The cumulative number of new daily COVID-19 cases reported in each country, recorded as an integer.

The dataset’s structure indicates that there are 20 records and 6 features. Each row in the data frame represents a summary of observations for different countries. The columns contain distinct attributes: the name of the country, its population, area in square kilometers, GDP per capita adjusted for purchasing power, population density, and the total number of COVID-19 cases.

Correlation Analysis

model_df
##      country      pop  sq_km gdp_pps   pop_dens total_cases
## 1    Austria  8901064  83858     128  106.14448     5387997
## 2    Belgium 11522440  30510     118  377.66109     4597870
## 3   Bulgaria  6951482 110994      51   62.62935     1260132
## 4     Cyprus   888005   9251      91   95.99016      519761
## 5    Denmark  5822763  44493     129  130.86919     3163687
## 6    Finland  5525292 338145     111   16.34001     1305112
## 7     France 67320216 551695     104  122.02434    36942541
## 8    Germany 83166711 357386     123  232.70836    35284167
## 9    Hungary  9769526  93030      71  105.01479     2138099
## 10   Ireland  4964440  70273     190   70.64506      607291
## 11    Latvia  1907675  64589      69   29.53560      902622
## 12 Lithuania  2794090  65300      81   42.78851     1241739
## 13     Malta   514564    316     100 1628.36709       76852
## 14    Norway  5367580 385178     142   13.93532      549922
## 15    Poland 37958138 312685      71  121.39418     6157508
## 16  Portugal 10295909  88416      78  116.44848     5505030
## 17   Romania 19328838 238397      65   81.07836     3245139
## 18  Slovakia  5457873  49036      71  111.30339     2475792
## 19     Spain 47332614 498511      91   94.94798    13561646
## 20    Sweden 10327589 450295     120   22.93516     2585009
model_df[c(2,4:6)]
##         pop gdp_pps   pop_dens total_cases
## 1   8901064     128  106.14448     5387997
## 2  11522440     118  377.66109     4597870
## 3   6951482      51   62.62935     1260132
## 4    888005      91   95.99016      519761
## 5   5822763     129  130.86919     3163687
## 6   5525292     111   16.34001     1305112
## 7  67320216     104  122.02434    36942541
## 8  83166711     123  232.70836    35284167
## 9   9769526      71  105.01479     2138099
## 10  4964440     190   70.64506      607291
## 11  1907675      69   29.53560      902622
## 12  2794090      81   42.78851     1241739
## 13   514564     100 1628.36709       76852
## 14  5367580     142   13.93532      549922
## 15 37958138      71  121.39418     6157508
## 16 10295909      78  116.44848     5505030
## 17 19328838      65   81.07836     3245139
## 18  5457873      71  111.30339     2475792
## 19 47332614      91   94.94798    13561646
## 20 10327589     120   22.93516     2585009
library(kableExtra)
library(tidyr)
correl <- cor(model_df[c(2,4:6)])
correl %>%
  kbl() %>%
  kable_styling()
pop gdp_pps pop_dens total_cases
pop 1.0000000 0.0235758 -0.0813254 0.9428576
gdp_pps 0.0235758 1.0000000 0.0205849 0.0918025
pop_dens -0.0813254 0.0205849 1.0000000 -0.0493727
total_cases 0.9428576 0.0918025 -0.0493727 1.0000000
pairs(model_df[c(2,4:6)], col = "blue", main = "Correlation Matrix")

### Correlation: There is a strong positive correlation between ‘pop’ and ‘total_cases’ (0.9428576), suggesting that countries with larger populations tend to report more COVID-19 cases. Other variables may not have a relationship with each other as the correlation value is near zero.

Generic Model with all variables

model1 <- lm(total_cases ~ pop+gdp_pps+pop_dens, model_df)
summary(model1)
## 
## Call:
## lm(formula = total_cases ~ pop + gdp_pps + pop_dens, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8441289 -1186145    55312  1722407  8943757 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.444e+06  2.828e+06  -1.218    0.241    
## pop          4.316e-01  3.729e-02  11.574 3.47e-09 ***
## gdp_pps      2.206e+04  2.596e+04   0.850    0.408    
## pop_dens     7.848e+02  2.467e+03   0.318    0.755    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3760000 on 16 degrees of freedom
## Multiple R-squared:  0.8945, Adjusted R-squared:  0.8747 
## F-statistic: 45.22 on 3 and 16 DF,  p-value: 4.881e-08

STEPWISE Regression

The produced model has adjust R^2 of 0.8747, we need to produce a parsimonious model, and to do so, we will need to drop some of the variables.

The method we choose to do this was with StepWise Regression AIC.

library(MASS)
stepAIC(model1)
## Start:  AIC=609.13
## total_cases ~ pop + gdp_pps + pop_dens
## 
##            Df  Sum of Sq        RSS    AIC
## - pop_dens  1 1.4304e+12 2.2760e+14 607.26
## - gdp_pps   1 1.0204e+13 2.3638e+14 608.01
## <none>                   2.2617e+14 609.13
## - pop       1 1.8938e+15 2.1199e+15 651.89
## 
## Step:  AIC=607.26
## total_cases ~ pop + gdp_pps
## 
##           Df  Sum of Sq        RSS    AIC
## - gdp_pps  1 1.0382e+13 2.3799e+14 606.15
## <none>                  2.2760e+14 607.26
## - pop      1 1.8980e+15 2.1256e+15 649.94
## 
## Step:  AIC=606.15
## total_cases ~ pop
## 
##        Df  Sum of Sq        RSS    AIC
## <none>               2.3799e+14 606.15
## - pop   1 1.9057e+15 2.1436e+15 648.11
## 
## Call:
## lm(formula = total_cases ~ pop, data = model_df)
## 
## Coefficients:
## (Intercept)          pop  
##  -1.089e+06    4.313e-01

Based on the results of Stepwise Regression, the population variable has the lowest AIC value, indicating the best fit for the model. Therefore, we use the population variable to fit the model.

model2 <- lm(total_cases ~ pop, model_df)
summary(model2)
## 
## Call:
## lm(formula = total_cases ~ pop, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9126129  -702074   608600  1214732  8993753 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.089e+06  1.024e+06  -1.064    0.301    
## pop          4.313e-01  3.593e-02  12.006 5.01e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3636000 on 18 degrees of freedom
## Multiple R-squared:  0.889,  Adjusted R-squared:  0.8828 
## F-statistic: 144.1 on 1 and 18 DF,  p-value: 5.009e-10

Additionally, we include population and GDP in Model 3 as they are the second-best choice according to Stepwise Regression.

model3 <- lm(total_cases ~ pop+gdp_pps, model_df)
summary(model3)
## 
## Call:
## lm(formula = total_cases ~ pop + gdp_pps, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8461112 -1323197   377232  1619448  8946779 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.305e+06  2.719e+06  -1.216    0.241    
## pop          4.306e-01  3.616e-02  11.906 1.13e-09 ***
## gdp_pps      2.224e+04  2.526e+04   0.881    0.391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3659000 on 17 degrees of freedom
## Multiple R-squared:  0.8938, Adjusted R-squared:  0.8813 
## F-statistic: 71.56 on 2 and 17 DF,  p-value: 5.263e-09

MSE comparison

mean(model2$residuals^2)
## [1] 1.189929e+13
mean(model3$residuals^2)
## [1] 1.138018e+13
par(mfrow=c(1,2))
hist(model2$residuals, main = "Histogram of Residuals", col = "red", freq = FALSE,ylab="Frequency")
#curve(dnorm(x,0,sd(x)),add=TRUE, col="green", lwd = 2)

qqnorm(model2$residuals, main = "Q-Q Plot of Residuals", col = "red", pch = 16)
qqline(model2$residuals, col = "green", lty = 2, lwd = 2)

par(mfrow=c(1,2))
hist(model3$residuals, main = "Histogram of Residuals", col = "red", freq = FALSE,ylab="Frequency")
#curve(dnorm(x,0,sd(x)),add=TRUE, col="green", lwd = 2)

qqnorm(model3$residuals, main = "Q-Q Plot of Residuals", col = "red", pch = 16)
qqline(model3$residuals, col = "green", lty = 2, lwd = 2)

The Best Fitting model

The recommended model 2 with the Stepwise method includes population as parameters, but the R^2 seems to be slightly larger than the initial model. Although the Stepwise methos recommend includes population as parameters, we try to fit a model 3 with population and gdp as parameters.The model3 has a similar p value and adjust R^2 value. We believe DGP is a great predictor to predict the total case, so model 3 is the best model including population and GDP.

EDA for the identified predictors

p1 <- ggplot(model_df, aes(x = pop, y = total_cases)) +
  geom_point(color = "blue") +
  ggtitle("Population vs Total Cases") +
  theme_minimal()


p2 <- ggplot(model_df, aes(x = gdp_pps, y = total_cases)) +
  geom_point(color = "purple") +
  ggtitle("GDP vs Total Cases") +
  theme_minimal()


grid.arrange(p1, p2, ncol = 2)

qq = ggplot(model_df, aes(sample = total_cases))+
  geom_qq(color = "orange")+
  geom_qq_line()+
  labs(title = "Normality for response", x = "Theoretical Quantiles", y = "Sample Quantiles")
qq 

Reviewing Residuals

Although the test of normality for the response variable is optional and doesn’t add value to the analysis. The test of normality for the residuals does.

We review the residuals closely and identify what predictors create issues if any and also if there are outliers. In this step, we compare the residuals of model 2 and model 3

r <- residuals(model3)
par(mfrow = c())
## named list()
hist(r,30, main = "Residuals distribution", col = "red")

plot(r,main = "Fitted Residuals", col = "red")
abline(h=0)

library(ggplot2)
library(moments)
x <- r
par(mfrow=c(1,2))
hist(r, main = "Histogram of Residuals", col = "red", freq = FALSE,ylab="Frequency")
curve(dnorm(x,0,sd(x)),add=TRUE, col="green", lwd = 2)

qqnorm(r, main = "Q-Q Plot of Residuals", col = "red", pch = 16)
qqline(r, col = "green", lty = 2, lwd = 2)

r_2 <- residuals(model2)
x <- r_2
par(mfrow=c(1,2))
hist(r_2, main = "Histogram of Residuals", col = "red", freq = FALSE,ylab="Frequency")
curve(dnorm(x,0,sd(x)),add=TRUE, col="green", lwd = 2)

qqnorm(r_2, main = "Q-Q Plot of Residuals", col = "red", pch = 16)
qqline(r_2, col = "green", lty = 2, lwd = 2)

When comparing the residuals of model 2 and model 3, we found that the distribution of model 2 and model 3 are normally distributed and most of the points lay on the qq line. However, model 3 is the best-fitted model because the residuals fit on the qq line better than model 2

Autocorrelation Durbin Test

A test on autocorrelation of residuals. The test shows autocorrelation of the points.

library(car)
dwt(model3)
##  lag Autocorrelation D-W Statistic p-value
##    1       -0.116894      2.209375   0.634
##  Alternative hypothesis: rho != 0

In summary, the Durbin-Watson test result suggests that there is no significant autocorrelation in the residuals of the regression model model3, as the D-W statistic is close to 2, and the p-value is much higher than 0.05. Therefore, the null hypothesis of no autocorrelation cannot be rejected.

detect residual pattern-model3

# detect residual pattern
residual_pop = ggplot(model_df,aes(x=pop, y=r))+
  geom_point()+
  labs(title = "Residuals versus population", x = "population", y = "Residuals" )

residual_gdp = ggplot(model_df,aes(x=gdp_pps, y=r))+
  geom_point()+
  labs(title = "Residuals versus GDP", x = "GDP", y = "Residuals" )


grid.arrange(residual_pop,residual_gdp,ncol = 2 )

Conclusion: There is no trend in the residual plot, suggesting that residuals do not have relationships with independent variables.

The last thing we will do is use our model to predict the total new cases of two (2) countries not included in our model fit. At minimum, your work should include:

  • The predicted total new cases for both countries.
  • The actual total new cases for both countries.
  • A short statement on the performance of the model in these two (2) cases.
    • Compare the new predictions to those made on the fitted dataset. You may compare the predicted values or the residuals.
# The code below defines our 'newdata' data frame for applying our model to the
# population, population density and GDP per capita for two (2). Please execute
# the code as given.

newdata <- data.frame(country = c("Luxembourg", "Netherlands"),
                      pop = c(626108, 17407585),
                      gdp_pps = c(261, 130),
                      pop_dens = c(626108, 17407585) / c(2586, 41540))

# Add code here returning the actual  total cases from our dataset for the
# Netherlands and Luxembourg.

actual_data <- data %>% 
  filter(countriesAndTerritories  %in% c("Netherlands", "Luxembourg")) %>% 
  group_by(countriesAndTerritories) %>%
  dplyr::summarize(total_cases = sum(cases, na.rm = TRUE))

print(actual_data)
## # A tibble: 2 × 2
##   countriesAndTerritories total_cases
##   <fct>                         <int>
## 1 Luxembourg                   200079
## 2 Netherlands                 8399426
# Add code here returning the total cases for the Netherlands and Luxembourg
# predicted by our model.
# Predicting total cases using the model
predicted_cases <- predict(model3, newdata)

# Adding predictions to the newdata data frame
newdata$predicted_cases <- predicted_cases

# Viewing the results
print(newdata)
##       country      pop gdp_pps pop_dens predicted_cases
## 1  Luxembourg   626108     261 242.1145         2769977
## 2 Netherlands 17407585     130 419.0560         7082063

Log Transformation

Since the distribution of our dependent variable – total_cases – is skewed to the left, we want to see if making it more normally distributed can make the residuals more normally distributed thus improving the model’s predictive power.

The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. As we can see from the histogram before and after log transformation, the data is significantly less skewed after transformation.

log_model = model_df
log_model = log_model[c(1:2,4:6)]
par(mfrow = c(1,2))
hist(log_model$total_cases, main = "Total Cases Before Log Transformation", xlab = "Total Cases", col= "lightblue")
hist(log(log_model$total_cases), main = "Total Cases After Log Transformation", xlab = "Log Total Cases", col = "lightblue")

Correlation table after Log Transformation: we can observe that the correlations are different from before the transformation.

correl <- cor(log_model[2:5])
correl %>%
  kbl() %>%
  kable_styling()
pop gdp_pps pop_dens total_cases
pop 1.0000000 0.0235758 -0.0813254 0.9428576
gdp_pps 0.0235758 1.0000000 0.0205849 0.0918025
pop_dens -0.0813254 0.0205849 1.0000000 -0.0493727
total_cases 0.9428576 0.0918025 -0.0493727 1.0000000
pairs(log_model[c(2:5)], col = "blue", main = "Correlation Matrix")

Step AIC:

model_log = lm(total_cases~pop+gdp_pps+pop_dens, log_model)
stepAIC(model_log)
## Start:  AIC=609.13
## total_cases ~ pop + gdp_pps + pop_dens
## 
##            Df  Sum of Sq        RSS    AIC
## - pop_dens  1 1.4304e+12 2.2760e+14 607.26
## - gdp_pps   1 1.0204e+13 2.3638e+14 608.01
## <none>                   2.2617e+14 609.13
## - pop       1 1.8938e+15 2.1199e+15 651.89
## 
## Step:  AIC=607.26
## total_cases ~ pop + gdp_pps
## 
##           Df  Sum of Sq        RSS    AIC
## - gdp_pps  1 1.0382e+13 2.3799e+14 606.15
## <none>                  2.2760e+14 607.26
## - pop      1 1.8980e+15 2.1256e+15 649.94
## 
## Step:  AIC=606.15
## total_cases ~ pop
## 
##        Df  Sum of Sq        RSS    AIC
## <none>               2.3799e+14 606.15
## - pop   1 1.9057e+15 2.1436e+15 648.11
## 
## Call:
## lm(formula = total_cases ~ pop, data = log_model)
## 
## Coefficients:
## (Intercept)          pop  
##  -1.089e+06    4.313e-01

Step AIC chose population and population density for our model. However, the R squared and p-value is not as good as model3.

model_log1 = lm(total_cases~pop+pop_dens, log_model)
summary(model_log1)
## 
## Call:
## lm(formula = total_cases ~ pop + pop_dens, data = log_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9099198  -566886   273430  1310045  8990131 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.256e+06  1.159e+06  -1.084    0.293    
## pop          4.324e-01  3.697e-02  11.696 1.49e-09 ***
## pop_dens     8.321e+02  2.446e+03   0.340    0.738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3729000 on 17 degrees of freedom
## Multiple R-squared:  0.8897, Adjusted R-squared:  0.8768 
## F-statistic: 68.58 on 2 and 17 DF,  p-value: 7.259e-09

We want to check the normality of residuals, the shapiro test shows that model 3 barely passed the normality check while the log model performs better in terms of residual normality.

shapiro.test(model_log1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_log1$residuals
## W = 0.86763, p-value = 0.01068
shapiro.test(model3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model3$residuals
## W = 0.90771, p-value = 0.05766

Visualized comparison between the distribution of model3 and log model

par(mfrow = c(2,2))
hist(model_log1$residuals, main = "Log Model Residuals", col="lightblue")
qqnorm(model_log1$residuals)
hist(model3$residuals, main = "Model 3 Residuals", col = "lightpink")
qqnorm(model3$residuals)

From the histogram and qqplot we can conclude that the residuals of log model indeed appears more “normal” than model3. However, the other measurements such as R^2 and actual predicted value are not the best. Since model3 passed the normality check and performs better in other areas, model3 is still our best performing model.